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Abstract 

We compute the singular values of an m x n tall and skinny (m ^ n) sparse matrix A without 
dependence on m, for very large m. In particular, we give a simple nonadaptive sampling scheme 
where the singular values of A are estimated within relative error with high probability. Our 
^ I proven bounds focus on the MapReduce framework, which has become the de facto tool for 

handling such large matrices that cannot be stored or even streamed through a single machine. 
On the way, we give a general method to compute A^ A. We preserve singular values of 
A'^A with e relative error with shuffle size 0{'n?' /e^) and reduce-key complexity 0{n/€^). We 
further show that if only specific entries of A^A are required, then we can reduce the shufHe 
V^ ' size to 0(nlog(n)/s) and reduce-key complexity to 0(log(n)/s), where s is the minimum cosine 

^^ , similarity for the entries being estimated. All of our bounds are independent of ?7i, the larger 

dimension. 

o 

m ■ 1 Introduction 

There has been a flurry of work to solve problems in numerical linear algebra via fast approximate 
k><( ■ randomized algorithms. Starting with [TB] many algorithms have been proposed over older algo- 

U '■ rithms [HllISllIlKinKISKIlKIIKITlElIMlllElllO], with resuhs satisfying the traditional Monte 

- - -' Carlo performance guarantees: small error with high probability. 

These proposed algorithms require either streaming, or having access to the entire matrix A on 
a single machine, or communicating too much data between machines. This is not feasible for very 
large m (for example m = 10^^). In such cases, A cannot be stored or streamed through a single 
machine - let alone be used in computations. For such cases, MapReduce [8] has become the de 
facto tool for handling very large datasets. 

MapReduce is a programming model for processing large data sets, typically used to do dis- 
tributed computing on clusters of commodity computers. With large amount of processing power 
at hand, it is very tempting to solve problems by brute force. However, we combine clever sampling 
techniques with the power of MapReduce to extend its utility. 

Given an m x n matrix A with each row having at most L nonzero entries, we show how to 
compute the singular values and and right singular vectors of A without dependence on m, in a 



MapReduce environment. The SVD of A is written A = UTjV'^, where U is m x n, Ti is n x n, and 
V is n X n. 

We compute S and V. We do this by first computing A^A, which we do without dependence 
on m. Since A A = V'E'^V is n x n, for small n (for example n = 10^) we can compute the 
eigen-decomposition of A^A directly and retrieve V and S. What remains is to compute A^A 
efficiently and without harming its singular values, which is what the rest of the paper is focused 
on. 

Our main result is Algorithms [3] and HI along with proven guarantees given in Theorem 14.21 
which proves a relative error bound using the spectral norm. The proof uses a new singular value 
concentration inequality from Latala [21] that has not seen much usage by the theoretical computer 
science community. 

2 Formal Preliminaries 

Label the columns of A as ci, . . . ,c„, rows as ri, . . . ,7-^, and the individual entries as Ojj. The 
matrix is stored row- by-row on disk and read via mappers. We focus on the case where each 
dimension is sparse with at most L nonzeros per row therefore the natural way to store the data is 
to segment into rows. 

We use the matrix spectral norm throughout, which for any m x n matrix A is defined as 

x^ Ay 
L4 2 = max 



x'gM'",yGR" IFII2II2/II2 

Unless otherwise denoted, the norm used anywhere in this paper is the spectral norm, which for 
regular vectors degenerates to the vector I2 norm. 

We concentrate on the regime where m is very large, e.g. m = 10^"^, but n is not too large, e.g. 
n = 10^, such that we can compute the SVD of an n x n dense matrix on a single machine. The 
magnitudes of each column is assumed to be loaded into memory and available to both the mappers 
and reducers. The magnitudes of each column are natural values to have computed already, or can 
be computed with a trivial mapreduce. 

2.1 Naive Computation 

The naive way to compute A^ A on MapReduce is to materialize all dot products between columns 
of A trivially. For purposes of demonstrating the complexity measures for MapReduce, we briefly 
write down the Naive algorithm to compute A^ A. 

Algorithm 1 NaiveMapper(rj) 
for all pairs {aij,aik) in rj do 

Emit ((cj,Cfc) -^ aijUik) 
end for 



Algorithm 2 NaiveReducer((ci, Cj), {vi, . . . , vr)) 






output cj Cj — )■ Yl 



Vi 



2.2 Complexity Measures 

There are two main complexity measures for MapReduce: "shuffle size" , and "reduce-key complex- 
ity" . These complexity measures together capture the bottlenecks when handling data on multiple 
machines: first we can't have too much communication between machines, and second we can't 
overload a single machine. The number of emissions in the map phase is called the "shuffle size" , 
since that data needs to be shuffled around the network to reach the correct reducer. The maximum 
number of items reduced to a single key is called the "reduce-key complexity" and measures how 
overloaded a single machine may become |19j . 

It can be easily seen that the naive approach for computing A^A will have 0{mL'^) emissions, 
which for the example parameters we gave {m = 10^^, n = 10^, L = 20) is infeasible. Furthermore, 
the maximum number of items reduced to a single key can be as large as m. Thus the "reduce-key 
complexity" for the naive scheme is m. 

We can drastically reduce the shuffle size and reduce-key complexity by some clever sampling 
with the DIMSUM scheme described in this paper. In this case, the output of the reducers are 
random variables whose expectations are cosine similarities i.e. normalized entries of A^A. Two 
proofs are needed to justify the effectiveness of this scheme. First, that the expectations are indeed 
correct and obtained with high probability, and second, that the shuffle size is greatly reduced. We 
prove both of these claims. In particular, in addition to correctness, we prove that for relative error 
e, the shuffle size of our scheme is only 0(n^/e^), with no dependence on the dimension m, hence 
the title of this paper. 

This means as long as there are enough mappers to read the data, our sampling scheme can be 
used to make the shuffle size tractable. Furthermore, each reduce-key gets at most 0{n/e^) values, 
thus making the reduce-key complexity tractable, too. Within Twitter Inc, we use the DIMSUM 
sampling scheme to compute similar users \1G\ ?]. We have also used the scheme to find highly 
similar pairs of words, by taking each dimension to be the indicator vector that signals in which 
tweets the word appears. We empirically verified the proven claims in this paper, but do not report 
experimental results since we are primarily focused on the proofs. 

2.3 Related Work 

Frieze et al. jTS] introduced a sampling procedure where rows and columns of A are picked with 
probabilities proportional to their squared lengths and used that to compute an approximation to 
A^A. Later [1] and [2] improved the sampling procedure. To implement these approximations to 
A^A on MapReduce one would need a shuffle size dependent on m or overload a single machine. 
We improve this to be independent of m both in shuffle size and reduce-key complexity. 

Later on [9] found an adaptive sampling scheme to improve the scheme of |18j . Since the scheme 
is adaptive, it would require too much communication between machines holding A. In particular a 
MapReduce implementation would still have shuffle size dependent on m, and require many (more 
than 1) iterations. 

There has been some effort to reduce the number of passes required through the matrix A 
using little memory, in the streaming model [6]. The question was posed by [22] to determine 
in the streaming model various linear algebraic quantities. The problem was posed again by [23] 
who asked about the time and space required for an algorithm not using too many passes. The 
streaming model is a good one if all the data can be streamed through a single machine, but with 
m so large, it is not possible to stream A through a single machine. Splitting the work of reading A 



across many mappers is the job of the MapReduce implementation and one of its major advantages 

There has been recent work specifically targeted at computing the SVD on MapReduce [3j 
in a stable manner via QR factorizations and bypassing A A, with shuffle size and reduce- key 
complexity both dependent on m. 

In addition to computing entries of A A, our sampling scheme can be used to implement many 
similarity measures. We can use the scheme to efflciently compute four similarity measures: Cosine, 
Dice, Overlap, and the Jaccard similarity measures, with details and experiments given in |251 ?], 
whereas this paper is more theoretically focused. 

3 Algorithm 

Our algorithm to compute A A efflciently is given below in Algorithms [3] and HI 
Algorithm 3 DIMSUMMapper(ri) 



for all pairs (ajj,ajfc) in rj do 
With probability 



emit ((cj,Cfc) -> aijQik) 
end for 



min (1,77 



c? K'fc 



Algorithm 4 DIMSUMReducer((ci, c^), {vi, . . . , vr)) 



if ir-TTTi — M > 1 then 



output bii -^ Ti — rAi — iT y^i_i Vi 
else 

output hij -^ i Ya=i Vi 
end if 



It is important to observe what happens if the output 'probability' is greater than 1. We 
certainly Emit, but when the output probability is greater than 1, care must be taken while reducing 
to scale by the correct factor, since it won't be correct to divide by 7, which is the usual case when 
the output probability is less than 1. Instead, the sum in Algorithm U] obtains the dot product, 
because for the pairs where the output probability is greater than 1, DIMSUMMapper effectively 
always emits. We do not repeat this point later in the paper, nonetheless it is an important one 
which arises during implementation. 

4 Correctness 

Before we move onto the correctness of the algorithm, we must state Latala's Theorem [21]. This 
theorem talks about a general model of random matrices whose entries are independent centered 
random variables with some general distribution (not necessarily normal). The largest singular 



value (the spectral norm) can be estimated by Latala's theorem for general random matrices with 
non- identically distributed entries: 

Theorem 4.1. (Latala's theorem ]21^). Let X he a random matrix whose entries Xij are indepen- 
dent centered random variables with finite fourth moment. Denoting ||^||2 o,s the matrix spectral 
norm, we have 



EllXlb < C 



max N^Exjj- 




We analyze the second and fourth central moments of the entries of the estimate for A A, 
and show that by Latala's theorem, the singular values are preserved with high probability. Let 
the matrix output by the DIMSUM algorithm be called B with entries bij. Notice that this is 
an n X n matrix of cosine similarities between columns of A. Define a diagonal matrix D with 
dii = \\ci\\. Then we can undo the cosine similarity normalization to obtain an estimate for A^A 
by using DBD. This effectively uses the cosine similarities between columns of A as an importance 
sampling scheme. We have the following theorem: 

Theorem 4.2. Let A be an m x n tall and skinny (m > n) matrix. If ^ = VL{n/e^) and D a 
diagonal matrix with entries da = \\ci\\, then the matrix B output by DIMSUM (AlgorithmslM and 
\^ satisfies, 

\\DBD-A^A\\2 

with probability at least 1/2. 



<e 



Proof. We define the indicator variable Xij^ to take value a^iakj with probability pij = 7 m — rTM — m 

I I '-'i 1 1 1 1 ^J 1 1 

on the /c'th call to DIMSUMMapper, and zero with probability 1 — pij. 

X.. = i "-kjO-kj with prob. pjj 
^■' \ with prob. 1 — pij 

Then we can write the entries of B as 



^ m 



bij — 7 ,^ijk 
^ fc=l 

Since we give relative error bounds and singular values scale trivially, we can assume A has all 
entries in [0, 1]. i.e. any scaling of the input matrix will have the same relative error guarantee. 
This assumption will be useful because we first prove an absolute error bound, then use that to 
prove a relative error bound. It should be clear from the definitions that in expectation 

E[B\ = D^^A^AD-^ and E[DBD] = A^A 

With these definitions, we now move onto bounding IE[||i? — D~^y4-^y4D~^||]. With the goal of 
invoking Latala's theorem, we analyze E[(6jj — Ebij)'^] and E[(6jj — Ebij)^]. 

Now define i^{i,j) as the number of dimensions in which c, and Cj are both nonzero, i.e. the 
number of k for which a^iakj is nonzero, and further define i n j as the set of indices for which 
d-kiO-kj is nonzero. 



Clearly, E[(6jj — Ebij)"^] is the variance of bij, which is the sum of i^{i,j) weighted indicator 
random variables. Thus we have 

E[(6,, - Ebijf] = Yav{bij] = ^ Yl Var[X,,fe] 

^ k&nj 

= T2 Z^ (^ki^kjVijO- - Pij) 

keinj 

\ ^ 2 2 

- :^ 2^ "-ki^kjPij 

k£inj 
1 V^ 2 2 '^ 



T 



k&nj 



Now by the Arithmetic-Mean Geometric-Mean inequality, 



1 v^ 2 2 /^ 1 1 \ 

= -E 



2 2 
O'ki'^kj 



keinj 



< - y] aiiO^ ■ ( -—112 ) 

' keinj Ml jM / 

1 v^ oL- 1 



Thus we have E[{bij — Ebij)'^] < -. It remains to bound the fourth central moment of bij. We 
use a counting trick to achieve this bound: 



E[{h, - Eb,,)^] = ^E 



/ J -^ijk Oiki^kjPij 



<kGinj 



■E 



r 



/ ^ k^ijq 0-qiO'qjPij)\-^ijr OiriO'rjPijjy^ijs (^siCl'SjPij )\-^ijt (ItiO'tjPij ) 

q,r,s,t£ir\j 

— J / ^ '^ [\^ijq 0,qi^qjPij)\-^ijr OjriO-rjPij )\^ijs Cl'siO'SjPij )\^ijt OitiO'tjPij )\ 
q,r,s,t£inj 

which effectively turns this into a counting problem. The terms in the sum on the last expression 
are unless either q = r = s = t, which happens #{i,j) times, or there are two pairs of matching 
indices, which happens { 2 ) (2) times. Continuing, this gives us 

= — ^ E[{Xijk - akiakjPij)^] + — E Var[Xijg]Var[Xij>] 



keinj 



q,r£inj 



= -4 XI '^tiatjiPiji'^ - Pij) + (1 -P*j)Vii] 
keinj 

^ J_ V^ 4 4 I J_ V^ 2 2 2 2 2 

k€inj g,r£inj 

1 1 V^ 4 4 I J^ 1 V^ 2 2 2 2 

2^ "fcj%- + 2 |U.||2|U.||2 Z^ O.qiO'qj'^ri'^rj 



7^ IIqIIIIciII '^ — ' ■"" ■"■' 7' llCjirilc,- , 

' ■'" k&nj ■"' q,r&nj 

by the Arithmetic-Mean Geometric-Mean inequahty, 

^ 1 / 1 I 1 N, V^ 4 4 _|_ J^ 1 V^ 2 2 2 2 

- 9^3M|^.||2 "^ |U.||2'' 2^ '^fcj'^'fci + 2 |U.||2|U.||2 Z^ O-qi^-qjO-riO-rj 
' " " " ■"' keinj II Ml II J II q^r&nj 

and since entries a^j € [0, 1], 






1 / 1 , 1 N V^ 2 2, J^ 1 V^ 

1 ■^ M I 119 """ 1 1 1 19 / / V ^ki'^kj "T 9 11 11911 119 / , 

■'; '-'7 (-"7 / '-'7'--1 



for 7 > 1, 



\ — ^ 2 \ — ^ 2 2 

- 3||„.||2 2^ ^fci + 72 ||„.||2||„.||2 2^ "gj^ri 

' " " k&r\] ' II 'II II ^11 q,rg«nj 

1 1 

^ — + — 



2 

^7 



Thus we have that IE[(6ij — Ebij)'^] < -^, and from the above we have ]E[(6ij — Ebij)'^] < -. 
Plugging these into Theorem 14.11 we can bound the absolute error between B and D~^A^AD~'^, 



E[\\B - D-'^A^AD-'^W] < Co[max ^E[{bij - Ebij) 



1/2 
\2i 



1/2 / \ 1/4 

4l 



+ max I Y, nibij - Ehjf] j + [Y. ^[(^^J' " ^^^ 

\ i J \ i,j 

/ \ 1/2 / \ 1/2 /o 2\ 1/4 

/ \ 1/2 



where Cq and Ci are absolute constants. Thus we have that 

/ \ 1/2 
E[\\B - D~^A^AD~^\\] <Ci(-] 

Setting 7 = 4:Cf^, gives 

E[\\B - D-'^A^AD^'^W] < e/2 

Thus by the Markov inequahty we have with probabihty at least 1/2, 

\\B-D~^A'^AD-^\\ <e 

Which gives us an absolute error bound between B and D~^A^AD^^. It remains to get a 
relative error bound between DBD and A A, 

WDBD-A^AW \\D(B-D~^A'^AD~'^)D\\ 



WA^AW WA^AW 

by the submultiplicative property of the spectral norm, 

^iDimW-D-'A^AD-Ml 



< 



\\ATA\\ 



Now since D is a, diagonal matrix with positive entries, its spectral norm is its largest entry, i.e. 
the largest column magnitude, call it c^,, 

cl\\B-D-'ATAD-m 



\\ATA\\ 
Now we use another property of the spectral norm to lowerbound ||^^^||, 



1^ ^11 = max 



X 



T 



A^Ay 



x,yeR" I fI Illy 1 1 

Setting X, y to be indicator vectors to pick out the i'th diagonal entry of A^A, we have that 
ll^^ll ^ c1 since c^ is some entry in the diagonal of A A. In addition to allowing us to bound 
the fourth central moment, this is yet another reason why we picked the sampling probabilities in 
Algorithm [3j Finally, continuing from above armed with this lower bound, 

WDBD-A^AW cl\\B-D^^A^AD-^\\ 



\\ATA\\ - \\ATA\\ 

< 



ch 



\\ATA\\ 
< 



de 



cl 



with probability at least 1/2. 

D 



Although we had to set 7 = VL{n/e'^) to estimate the singular values, if instead of the singular 
values we are interested in individual entries of A^ A that are large, we can get away setting 7 
significantly smaller, and thus reducing shuffle size. In particular if two columns have high cosine 
similarity, we can estimate the corresponding entry in A^ A with much less computation. Here we 
define cosine similarity as the normalized dot product 



cos (Ci,Cj) 



Theorem 4.3. For any two columns Ci and Cj having cos(cj, Cj) > e, let B he the output of DIMSUM 
with entries bij = -'^'k=i^ijk with Xijk as defined in Theorem \4-S\ Now if ^ = ^{a/e), then we 
have, 



Pr [||c,||||c,||6,, > {1 + 6)IA^A],,] < (^^^^ 



-5) 



and 



Pr [||Q||||c,-||6ij < (1 - d)[A^A],,] < exp(-a<5V2) 
Proof. We use ||cj||||cj||6ij as the estimator for [j4-^A]jj. Note that 



Hij = E[^ Xijk] = 7t 



„T, 






'J cos{x,y) > a 



Thus by the multiplicative form of the Chernoff bound, 



FT[\\ci\\\\cj\\b,j>{l + 6)[A'^A]ij] =Pr 



7' 



''^""'^■'L->7(i + 5)-t^^^]*^' 



Ci C7 



CV C-1 



Pr 



< 



.fc=i fe=i 

Similarly, by the other side of the multiplicative Chernoff bound, we have 



(l + <5)(i+5) 



Fr\\\ci\\\\cj\\b,,<il + 6)[A^AU = Pr 






\bij < 7(1 + 6) 



[A'A], 



fc=i fc=i 

< exp{-i^iij6^/2) < exp{-a5^/2) 



D 



5 Shuffle Size 

Define H as the smallest nonzero entry of A in magnitude, after the entries of A have been scaled 
to be in [0, 1]. For example when A is a 0-1 matrix, H = 1. 



Theorem 5.1. Let A be anmxn tall and skinny (m > n) sparse matrix A with at most L nonzeros 
per row. The expected shuffle size for DIMSUMMapper is 0{nL^/ H"^). 

Proof. Define il={ci,Cj) as the number of dimensions in which a and Cj are both nonzero, i.e. number 
of k for which a^iakj is nonzero. 

The expected contribution from each pair of columns will constitute the shuffle size: 

n n #(ci,Cj) 

^ ^ J]] Pr [DIMSUMMapper (ci,Cj)] 

i=l j=i+l k=l 
n n 

= ^Y^ #(ci,cj)Pr[DIMSUMMapper(ci,Cj)] 

n n ^, ^ 

V^ V^ #{Ci,Cj) 






77 



c,: 



By the Arithmetic-Mean Geometric-Mean inequality, 



i 



n .. n 

^^E^^E#(^-^^-) 

n 

^^E^^ii^''ii'/^' = 7W^' 

i=l 

The first inequality holds because of the Arithmetic-Mean Geometric-Mean inequality applied 
to {l/j|cij|, l/||cj||}. The last inequality holds because Cj can co-occur with at most ||ci|pL/i;f^ 
other columns. It is easy to see via Chernoff bounds that the above shuffle size is obtained with 
high probability. 

D 

Theorem 5.2. Let A be anmxn tall and skinny (m > n) sparse matrix A with at most L nonzeros 
per row. The shuffle size for any algorithm computing those entries of A^A for which cos{i,j) > e 
is at least 0,{nL). 

Proof. To see the lowerbound, we construct a dataset consisting of n/L distinct rows of length 
L, furthermore each row is duplicated L times. To construct this dataset, consider grouping the 
columns into n/L groups, each group containing L columns. A row is associated with every group, 
consisting of all the columns in the group. This row is then repeated L times. In each group, it is 
trivial to check that all pairs of columns have cosine similarity exactly 1. There are (2) pairs for 
each group and there are n/L groups, making for a total of (n/L) (2) = Q{nL) pairs with similarity 
1, and thus also at least e. Since any algorithm that purports to accurately calculate highly-similar 
pairs must at least output them, and there are 0,{nL) such pairs, we have the lower bound. D 

Finally it is easy to see that the largest reduce- key will have at most 0(7) values. 

10 



Theorem 5.3. The expected number of values mapped to a single key by DIMSUMMapper is 7. 

Proof. Note that the output of DIMSUMReducer is a number between and 1. Since this is 
obtained by normahzing the sum of all values reduced to the key by 7, and all summands are at 
most 1, we trivially get that the number of summands is at most 7. D 

6 Conclusions and Future Directions 

We presented the DIMSUM algorithm to compute A A for a very tall and skinny mxn matrix A. 
All of our results are provably independent of the dimension m, meaning that apart from the initial 
cost of trivially reading in the data, all subsequent operations are independent of the dimension, 
the dimension can thus be very large. 

Although we used Al A in the context of computing singular values, there are likely other linear 
algebraic quantities that can benefit from having a provably efhcient and accurate MapReduce 
implementation of A A. For example if one wishes to use the estimate for A A in solving the 
normal equations in the ubiquitous least-squares problem 

A'^Ax = A^y 

then the guarantee given by Theorem 14.21 gives some handle on the problem, although a concrete 
error bound is left for future work. 
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